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ABSTRACT 

Declarative large-scale machine learning (ML) aims at the 
specification of ML algorithms in a high-level language and 
automatic generation of hybrid runtime execution plans 
ranging from single node, in-memory computations to dis¬ 
tributed computations on MapReduce (MR) or similar 
frameworks like Spark. The compilation of large-scale ML 
programs exhibits many opportunities for automatic op¬ 
timization. Advanced cost-based optimization techniques 
require—as a fundamental precondition—an accurate cost 
model for evaluating the impact of optimization decisions. 
In this paper, we share insights into a simple and robust 
yet accurate technique for costing alternative runtime exe¬ 
cution plans of ML programs. Our cost model relies on gen¬ 
erating and costing runtime plans in order to automatically 
reflect all successive optimization phases. Costing runtime 
plans also captures control flow structures such as loops and 
branches, and a variety of cost factors like 10, latency, and 
computation costs. Finally, we linearize all these cost factors 
into a single measure of expected execution time. Within 
SystemML, this cost model is leveraged by several advanced 
optimizers like resource optimization and global data flow 
optimization. We share our lessons learned in order to pro¬ 
vide foundations for the optimization of ML programs. 

1. INTRODUCTION 

State-of-the-art systems for large-scale ML aim at declar¬ 
ative ML with high-level languages including linear alge¬ 
bra, statistical functions, and ML-specific constructs. This 
declarative approach allows users to write their custom ML 
algorithms once, independent of the underlying runtime 
framework, data or cluster characteristics. These high-level 
ML programs are then automatically optimized and com¬ 
piled into hybrid in-memory and distributed runtime plans. 
The major advantages of such a high-level language are the 
full flexibility to specify new or customize existing ML algo¬ 
rithms, physical data independence of the underlying data 
representation (e.g., dense/sparse, formats, matrix block¬ 
ing), and both efficiency and scalability via automatic cost- 
based optimization. There are many high impact optimiza¬ 
tion opportunities like static and dynamic algebraic rewrites, 
matrix multiplication chain optimization, decisions between 
single node and distributed plans, or alternative physical 
operators. However, any cost-based optimization technique 
requires an accurate cost model for evaluating alternative 
plans or quantifying the impact of optimization decisions. 


Cost Model Requirements: There are several impor¬ 
tant requirements on such a cost model for optimizing large- 
scale ML programs which originate from potentially dis¬ 
tributed runtime plans and ML program characteristics. 

• Analytical Cost Model (Rl): We need an analytical 
cost model in order to cost alternative runtime plans. 
The potentially large number of alternative plans pro¬ 
hibits a model relying on previous or sample runs. 

• Diverse Cost Factors (R2): Large-scale ML programs 
exhibit several orthogonal cost factors which all can 
turn into bottlenecks. This includes 10, latency, and 
computation costs. Simple cost models like the sum of 
intermediate result sizes cannot capture all. 

• Resource Awareness (R3): The optimization of ML 
programs is sensitive to available memory and paral¬ 
lelism. Hence, our cost model needs to be aware of 
cluster characteristics and resource configurations. 

• Complex Control Flow (R4): ML programs often con¬ 
tain deep control flow structures of loops, branches, 
and function calls. Our cost model needs to be able to 
cost arbitrary complex programs. 

In this paper, we share a simple and robust technique of 
costing generated runtime plans which is the result of several 
lessons we have learned applying earlier cost model versions 
in real-world use cases of SystemML 0[^ Ill- 

Example ML Program for Linear Regression: As 
our running example, we use a simplified version of a closed- 
form linear regression algorithm. Its conciseness makes it 
feasible to present generated runtime plans, which are rarely 
shown in the literature. The following DML script (w/ R- 
like syntax) solves an ordinary least square problem y = X/3. 

1: X = read($l); 

2: y = read($2); 

3: intercept = $3; lambda = 0.001; 

4: if( intercept == 1 ) { 

5: ones = matrixCl, nrow(X), 1); 

6: X = appendix, ones); 

7: } 

8: I = matrixCl, ncol(X), 1); 

9: A = t(X) •/.*•/, X + diag(I)*lambda; 

10: b = t(X) •/.*•/, y; 

11: beta = solve(A, b); 

12: write(beta, $4); 

In detail, we read two matrices X and y from HDFS, where 
we append a column of is to X if we are asked to compute 
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Table 1: Overview Scenarios of Input Sizes. 


Scenario 

X 

y 

Input Size 

LinregDS, XS 

nWlTlCb’ 

10"‘ X 1 

80 MB 

LinregDS, XLl 

10® X 10® 

10® X 1 

800 GB 

LinregDS, XL2 

10® X 2 ■ 10® 

10® X 1 

1.6TB 

LinregDS, XL3 

2 ■ 10® X 10® 

2 ■ 10® X 1 

1.6TB 

LinregDS, XL4 

2 ■ 10® X 2 ■ 10® 

2 ■ 10® X 1 

3.2 TB 


the model intercept. The core computation of this ML pro¬ 
gram (lines 9-11) then constructs and solves a linear system 
of equations with regularization A. The size of the intermedi¬ 
ate results A and b is determined by the number of features. 
Finally, we write the model coefficients /? to HDFS. 

In the rest of this paper, we discuss runtime plans gener¬ 
ated by SystemML for different input sizes and cluster char¬ 
acteristics as well as the costing of these generated plans. 
Selected details of the entire compilation chain are described 
in SystemML’s architecture [^, SystemML’s optimizer [^, 
and SystemML’s parfor optimizer for task-parallel ML pro¬ 
grams [^. We leverage SystemML’s text-based EXPLAIN tool 
that allows us to capture plans at different compilation lev¬ 
els like HOPs and runtime plans during initial compilation, 
as well as HOPs and runtime plans during recompilation. 

2. GENERATING RUNTIME PLANS 

In this section, we discuss the basics of generating runtime 
plans in SystemML. All examples are created on a 1-1-6 node 
cluster, i.e., one head node of 2x4 Intel E5530 @ 2.40 GHz- 
2.66 GHz with hyper-threading enabled and 64 GB RAM, as 
well as 6 nodes of 2x6 Intel E5-2440 @ 2.40 GHz-2.90 GHz 
with hyper-threading enabled, 96 GB RAM, 12x2 TB disk 
storage, 10Gb Ethernet. We used Hadoop 2.2.0 and a static 
cluster configuration with 2 GB max/initial JVM heap size 
for the client and map/reduce tasks. Our HDFS capacity 
was 107TB (11 disks per node), and we used an HDFS block 
size of 128 MB. Finally, our default configurations of Sys¬ 
temML are 12 reducers (2x number of nodes) and a memory 
budget ratio of 70% of the max heap size. 

Scenarios of Different Input Sizes: We use scenarios 
of different input sizes in order to show the effect on runtime 
plan generation. Table gives an overview of five scenarios 
ranging from very small to very large use cases. In detail, 
this table shows the input dimensions of X and y as well 
as the input data size in binary block format. Note that 
we use fully dense data sets, where the number of non-zeros 
is equal to the number of matrix cells. In the following, we 
discuss generated runtime plans of selected scenarios. 

Example HOP DAG (Scenario XS): First of all, we 
have a look at generated HOP DAGs for our example ML 
program, which allows a natural transition from script level 
to the level of runtime plans. We use scenario XS with in¬ 
put sizes of X: 10^ x 10^ (80 MB, dense, binary block) and 
y: 10^ X 1 (1MB, dense) as well as an intercept 0. Fig¬ 
ure]^ shows the HOP EXPLAIN output (after HOP rewrites, 
computation of memory estimates, and execution type se¬ 
lection). There are several noteworthy modifications com¬ 
pared to the original script. First, after constant folding, 
the branch condition (lines 4-7) became constant and hence 
was removed accordingly. Second, multiple rewrites trans¬ 
formed the expression diag(matrix(l,...))*laiiibda into 
diag(matrix(lambda,...)), which prevents one unneces¬ 
sary intermediate. Third, we propagated the input dimen¬ 
sion sizes over the entire program and computed the individ- 


# Memory Budget local/remote = 1434MB/1434MB/1434MB 

# Degree of Parallelism (vcores) local/remote = 24/144/72 
PROGRAM 

—MAIN PROGRAM 

-GENERIC (lines 1-3) [recompile=false] 

-(10) PRead X [le4,le3,le3,le3,le7] [76MB] CP 

-(11) TWrite X (10) [le4,le3,le3,le3.Ie7] [76MB] CP 

-(21) PRead y [164,1,le3.Ie3,le4] [0MB] CP 

-(22) TWrite y (21) [le4.1,le3,leB,le4] [0MB] CP 

-(24) TWrite intercept [0,0,-1,-1,-1] [0MB] CP 

-(26) TWrite lambda [0,0,-1,-1,-1] [0MB] CP 

-GENERIC (lines 8-12) [recompile=false] 

-(42) TRead X [le4,le3,le3,le3,le7] [76MB] CP 

-(52) r(t) (42) [Ie3,le4,le3,le3,le7] [153MB] CP 

-(53) ba(+*) (52,42) [le3,163,le3,le3,-1] [168MB] CP 

-(50) u(ncol) (42) [0,0,-l,-1,-1] [0MB] CP 

-(71) dg(rand) (50) [le3,1,le3,leB,le3] [0MB] CP 

-(54) r(diag) (71) [le3,le3,le3,le3,le3] [0MB] CP 

-(57) b(+) (53,54) [le3,le3,le3,le3,-1] [15MB] CP 

-(43) TRead y [le4,1,le3,le3,le4] [0MB] CP 

-(59) ba(+*) (52,43) [le3,1,le3,le3,-1] [76MB] CP 

-(60) b(solve) (57,59) [le3,1,le3,le3,-1] [15MB] CP 

-(66) PWrite beta (60) [le3,1,-1,-1,-1] [0MB] CP 

Figure 1: Example HOP DAG, Scenario XS. This 
program has two program blocks, w/ one HOP DAG 
per block. Every HOP shows its ID, operation, 
child IDs, output sizes (number of rows/columns, 
row/column block sizes, number of non-zeros), oper¬ 
ation memory estimate, and selected execution type. 

ual operation memory estimates (input, intermediate, and 
output memory requirements) accordingly. Obviously, for 
sparse input data, this is more challenging. Fourth, accord¬ 
ing to these memory estimates and the given memory bud¬ 
gets (local, remote map/reduce), we selected the execution 
type GP (control program), i.e., pure single node, in-memory 
operations for all HOPs. Apart from persistent/transient 
read/writes, the HOP DAG contains operators for transpose 
(r(t)), matrix multiplication (ba(+*)), matrix construction 
(dg(rand)), vector-to-diagonal matrix (r(diag)), element¬ 
wise binary addition (b(+)), and solving a linear system of 
equations (b(solve)). This program of HOP DAGs is then 
compiled over LOP DAGs into a runtime program of exe¬ 
cutable program blocks and instructions. 

Example Runtime Programs (Scenario XS): Given 
the described program of HOP DAGs, we now can discuss 
runtime plan generation. We first look at the small sce¬ 
nario XS (80MB) due to its simple translation. Figure]^ 
shows the generated runtime plan where we also see ad¬ 
ditional optimizer choices. First, for X^X (HOP 53), we 
selected the physical operator tsmm (transpose-self matrix 
multiply) in order to exploit the unary input characteristic 
and the known result symmetry which allows to do only half 
the computation. Second, we applied a specific HOP-LOP 
rewrite, transforming X^y (HOP 59) into (y^X)^ in order 
to prevent the transpose of X. This is done during LOP con¬ 
struction, because it exhibits additional memory constraints 
what we will discuss later in more detail. Note that we also 
compile size information into the runtime plan in order to 
provide operations with all available meta data. 

Example Runtime Program (Scenario XLl:) We 
now also discuss a larger scenario XLl (800 GB). For this 
scenario, memory estimates of HOPs 52, 53, and 59 are 
>1TB, which is larger than the local memory budget of 
1,434 MB and hence, we select the execution type MR for 
these operators. Figure]^ shows the generated runtime plan 
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PROGRAM ( size CP/MR = 34/0 ) 

—MAIN PROGRAM 

-GENERIC (lines 1-3) [recompile=false] 

-CP createvar pREADX ./mboehm/cost/X false binaryblock 10000 1000 1000 1000 10000000 

-CP createvar pREADy ./mboehm/cost/y false binaryblock 10000 1 1000 1000 10000 

-CP assignvar 0.SCALAR.INT.true intercept.SCALAR.INT 

-CP assignvar 0.0010.SCALAR.DOUBLE.true lambda.SCALAR.DOUBLE 

-CP cpvar pREADX X 

-CP cpvar pREADy y 

-GENERIC (lines 8-12) [recompile=false] 

-CP createvar _mVar2 scratch_space//_p4140352_9.1.70.96//_t0/templ true binaryblock 1000 1000 1000 1000 -1 

-CP tsmm X.MATRIX.DOUBLE _mVar2.MATRIX.DOUBLE LEFT 

-CP createvar _mVar3 scratch_space//_p4140352_9.1.70.96//_t0/temp2 true binaryblock 1000 1 1000 1000 1000 

-CP rand 1000 1 1000 1000 0.0010 0.0010 1.0-1 uniform _mVar3.MATRIX.DOUBLE 

-CP createvar _mVar4 scratch_space//_p4140352_9.1.70.96//_t0/temp3 true binaryblock 1 10000 1000 1000 10000 

-CP r^ y.MATRIX.DOUBLE _mVar4.MATRIX.DOUBLE 

-CP createvar _mVar5 scratch_space//_p4140352_9.1.70.96//_t0/temp4 true binaryblock 1000 1000 1000 1000 1000 

-CP rdiag _mVar3.MATRIX.DOUBLE _mVar5.MATRIX.DOUBLE 

-CP createvar _mVar6 scratch_space//_p4140352_9.1.70.96//_t0/temp5 true binaryblock 1 1000 1000 1000 -1 

-CP ba+* _mVar4.MATRIX.DOUBLE X.MATRIX.DOUBLE _mVar6.MATRIX.DOUBLE 

-CP createvar _mVar7 scratch_space//_p4140352_9.1.70.96//_t0/temp6 true binaryblock 1000 1000 1000 1000 -1 

-CP + _mVar2.MATRIX.DOUBLE _mVar5.MATRIX.DOUBLE _mVar7.MATRIX.DOUBLE 

-CP createvar _mVar8 scratch_space//_p4140352_9.1.70.96//_t0/temp7 true binaryblock 1000 1 1000 1000 -1 

-CP r^ _mVar6.MATRIX.DOUBLE _mVar8.MATRIX.DOUBLE 

-CP createvar _mVar9 scratch_space//_p4140352_9.1.70.96//_t0/temp8 true binaryblock 1000 1 1000 1000 -1 

-CP solve _mVar7.MATRIX.DOUBLE _mVar8.MATRIX.DOUBLE _mVar9.MATRIX.DOUBLE 

-CP write _mVar9.MATRIX.DOUBLE ./mboehm/cost/b.SCALAR.STRING.true textcell.SCALAR.STRING.true 

Figure 2: Example Runtime Plan, Scenario XS (same structure and characteristics as described for Figure]^. 


PROGRAM ( size CP/MR = 29/1 ) 

—MAIN PROGRAM 

-GENERIC (lines 1-3) [recompile=false] 

-CP createvar pREADX ./mboehm/cost/X false binaryblock 100000000 1000 1000 1000 100000000000 

-CP createvar pREADy ./mboehm/cost/y false binaryblock 100000000 1 1000 1000 100000000 

-CP assignvar 0.SCALAR.INT.true intercept.SCALAR.INT 

-CP assignvar 0.0010.SCALAR.DOUBLE.true lambda.SCALAR.DOUBLE 

-CP cpvar pREADX X 

-CP cpvar pREADy y 

-GENERIC (lines 8-12) [recompile=true] 

-CP createvar _mVar2 scratch_space//_p4149973_9.1.70.96//_t0/templ true binaryblock 1000 1 1000 1000 1000 

-CP rand 1000 1 1000 1000 0.0010 0.0010 1.0-1 uniform _mVar2.MATRIX.DOUBLE 


■CP createvar _mVar3 scratch_space//_p4149973_9.1.70.96//_t0/temp2 true 
■CP partition y.MATRIX.DOUBLE _mVar3.MATRIX.DOUBLE R0W_BL0CK_WISE_N 
■CP createvar _mVar4 scratch_space//_p4149973_9.1.70.96//_t0/temp3 true 
■CP rdiag _mVar2.MATRIX.DOUBLE _mVar4.MATRIX.DOUBLE 

■CP createvar _mVar5 scratch_space//_p4149973_9.1.70.96//_t0/temp4 true 
■CP createvar _mVar6 scratch_space//_p4149973_9.1.70.96//_t0/temp5 true 
MR-Job[ 


binaryblock 

binaryblock 

binaryblock 

binaryblock 


100000000 1 1000 1000 100000000 

1000 1000 1000 1000 1000 

1000 1000 1000 1000 -1 
1000 1 1000 1000 -1 


j obtype = GMR 

input labels = [X, _mVar3] 

recReader inst = 

rand inst = 

mapper inst = MR tsmm 0.MATRIX.DOUBLE 2.MATRIX.DOUBLE LEFT, MR r^ 0.MATRIX.DOUBLE 3.MATRIX.DOUBLE, 
MR mapmm 3.MATRIX.DOUBLE 1.MATRIX.DOUBLE 4.MATRIX.DOUBLE RIGHT_PART false 

shuffle inst = 

agg inst = MR ak-*- 2.MATRIX.DOUBLE 5.MATRIX.DOUBLE true NONE, 

MR ak+ 4.MATRIX.DOUBLE 6.MATRIX.DOUBLE true NONE 


- other inst = 

- output labels = [_mVar5, _mVar6] 

- result indices = ,5,6 

- num reducers = 12 

- replication = 1 ] 

■CP createvar _mVar7 scratch_space//_p4149973_9.1.70.96//_t0/temp6 true binaryblock 1000 1000 1000 1000 -1 
■CP + _mVar5.MATRIX.DOUBLE _mVar4.MATRIX.DOUBLE _mVar7.MATRIX.DOUBLE 

■CP createvar _mVar8 scratch_space//_p4149973_9.1.70.96//_t0/temp7 true binaryblock 1000 1 1000 1000 -1 
■CP solve _mVar7.MATRIX.DOUBLE _mVar6.MATRIX.DOUBLE _mVar8.MATRIX.DOUBLE 

■CP write _mVar8.MATRIX.DOUBLE ./mboehm/cost/b.SCALAR.STRING.true textcell.SCALAR.STRING.true 


Figure 3: Example Runtime Plan, Scenario XLl (simplified runtime plan, where we removed rmvar (remove 
variable) instructions which follow directly after the last usage of related intermediates; instructions show 
their execution type, operation code, input variables, output variable, and instruction-specific arguments.). 
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that accordingly includes a generated MR-job instruction. 
There are again several interesting decisions being made 
here. First, we generated a hybrid runtime plan, where only 
operations on X are scheduled to MR while all other op¬ 
erations remain in CP. Second, we see important operator 
selection decisions. For X^X (HOP 53), we selected again a 
tsmm MR operator but with final aggregation (ak+, aggregate 
kahan plus [^) in order to aggregate partial mapper results. 
This aggregation instruction is transparently used in the 
combiner as well. For X^y (HOP 59), we selected a so-called 
mapmm (broadcast matrix multiplication), which broadcasts 
the smaller input through distributed cache. Similar to 
tsmm, we also have a hnal aggregation for this operator. 
Third, in contrast to Scenario XS, we did not apply the 
(y^X)^ rewrite and hence also execute the transpose as an 
MR instruction. The reason for this is that the new trans¬ 
pose of y would exceed the local memory budget and hence 
spawn an individual MR job with related latency. Fourth, 
we see that our piggybacking algorithm (that packs MR op¬ 
erations into a minimal number of MR jobs) was able to pack 
all these operations into a single MR job which (1) shares the 
scan of X, and prevents the materialization of X^. Sixth, 
we decided for a CP partitioning operation of the broadcast 
y in order to reduce unnecessarily large costs for reading y 
into every task (w/o partitioning and w/o JVM reuse, we 
would read 800 MB per task input split of 128 MB). Parti¬ 
tions (of 32 MB) are read on demand but never evicted to 
prevent repeated partition reads. 

Discussion Further Runtime Plans (Scenarios 
XL2/XL3/XL4): We now discuss the even larger scenarios 
XL2-XL4 which all require the optimizer to generate runtime 
plans that exhibit very different characteristics than XLl. 
First, in scenario XL2, X has 2,000 columns which is larger 
than the conhgured block size of 1,000. This prevents the 
optimizer from selecting a map-side tsmm operator because 
it requires to see entire rows of the input matrix. We select 
an cpmm operator instead, which requires two MR jobs. 
This implies that we have to shuffle X and a smaller degree 
of parallelism for the matrix multiplication. Piggybacking 
now also replicates the transpose of X into both jobs in order 
to prevent materializing the intermediate of X^. Second, for 
scenario XL3, X and y have 2 • 10® rows. This means that 
y is already 1.6 GB, which is larger than the given map-task 
memory budget of 1,434 MB and hence we generate a cpmm 
instead of the mapmm. Similar to scenario XL2, this leads to 
three MR jobs. Note that this decision is very sensitive to 
the cluster configuration (memory budget of map tasks in 
this case) and there are many operators that exhibit simi¬ 
lar memory or block size constraints. Third, scenario XL4 
combines the characteristics of XL2 and XL3 which leads 
to cpmm operators for both matrix multiplications but pig¬ 
gybacking generated again just three MR jobs because both 
aggregations are packed into a shared job. 

To summarize, even for a very simple script, we see major 
plan changes for different data sizes and cluster character¬ 
istics. Optimization decisions of several compilation steps 
effect each other and contribute to the final runtime plan. 
The bottom line is that only generated runtime plans in¬ 
clude all required information to evaluate cost factors like 
lO, latency, and computation costs. It is important to note, 
that generating runtime plans from HOP DAGs is rather 
efficient (<0.5ms for common DAG sizes), which makes the 
generation and costing of runtime plans feasible. 


3. COSTING RUNTIME PLANS 

In this section, we now discuss how to cost generated 
runtime plans which automatically reflects all optimization 
phases. Given a runtime plan P (with size information), we 
use a white-box cost model to compute the costs C(P,cc) 
as estimated execution time of P given the cluster configu¬ 
ration cc. This time-based model allows us to linearize lO, 
latency, and computation costs into the single cost measure 
(see R2). In contrast to related work of MR job tuning, 
it also gives us an analytical cost model for entire ML pro¬ 
grams (see R1 and R4) because it does not rely on prohling 
runs, and the runtime plans covers the entire control flow 
as well. Finally, our approach is also aware of available re¬ 
sources (see R3) because the compiler already respects all 
memory constraints when generating runtime plans, and we 
explicitly take the degree of parallelism into account. 

3.1 Basic Notation 

Before we can describe the actual cost estimator skele¬ 
ton, we need some basic notion. The runtime plan P con¬ 
sists of a hierarchy of program blocks bi £ B and instruc¬ 
tions insti £ I. A matrix X is described by size infor¬ 
mation of rows m, columns n, and sparsity s. We define 
s = nnz(X .)/(m • n), where nnz denotes the number of non¬ 
zero values. This information allows us to compute size esti¬ 
mates of in-memory matrices M (X) and serialized matrices 
M'(X) (e.g., on local disk or HDFS). Furthermore, let ki, 
km, and kr denote the degree of parallelism of the local con¬ 
trol program, available map slots, and available reduce slots, 
respectively. In case of YARN clusters, we correct km and 
kr according to the available virtual cores and memory re¬ 
sources of the cluster. Finally, let T{P) denote the estimated 
execution time of runtime plan P, which is eventually used 
as cost measure with C{P, cc) = T{P). 

3.2 Cost Estimator Skeleton 

The skeleton of our cost estimator recursively scans the 
runtime plan in execution order and tracks live variables 
including their sizes and in-memory state. During this sin¬ 
gle pass over the runtime program, we also compute time 
estimates per instruction and aggregate theses estimates ac¬ 
cordingly to the program structure. 

Tracking Live Variable States: Tracking sizes and in¬ 
memory state of variables is a fundamental precondition for 
costing individual instructions. We start with an empty 
symbol table. While costing the runtime plan, we main¬ 
tain live variable statistics in this table. First, for each 
createvar (creates meta data handle for a matrix variable), 
cpvar (binds a variable to a variable name), rmvar (removes 
a variable), and data generating instructions like rand or 
seq, we accordingly modify our live variable statistics (e.g., 
size information). Second, we also maintain in-memory state 
of variables. Persistent read inputs and MR job outputs 
are known to be on HDFS, while all in-memory instructions 
change the state of their inputs and output to in-memory. 
This state maintenance allows us to correctly reflect required 
lO costs. For example, if a persistent dataset is used by two 
in-memory instructions, only the hrst instruction will pay 
the costs of reading the input. This approach also allows 
us to reason about hybrid runtime plans of GP/MR instruc¬ 
tions, where intermediates are exchanged via HDFS. 

Time Estimate Aggregation over Control Flow: Fi¬ 
nally, we aggregate time estimates as we recursively iterate 
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over the program structure. Similar to statistics aggregation 
in the parfor optimizer for task-parallel ML programs , we 
aggregate the time estimate of an program block h by the 
sum of its childs c{h) (predicates, included program blocks, 
instructions) due to their sequential execution with: 


f{b) = Wb ^ T{ci), Wb 

Vci£c(b) 


\N/k'] parfor 
N for,while 

l/|c(n)| if 
1 otherwise. 


( 1 ) 


For conditional branches, the aggregate is a weighted sum 
of time estimates for the individual branches. For loops, we 
scale the time aggregate by the number of iterations; if the 
number of iterations is unknown (e.g., for while loops) we 
use a constant N which at least reflects that the body is 
executed multiple times. Note that we use additional cor¬ 
rections in order to account for overestimated read costs in 
loops, where only the first iteration reads persistent inputs. 
Furthermore, we also maintain function call stacks in order 
to prevent cycles when costing recursive functions. 

This cost estimator skeleton allows the costing of arbi¬ 
trary complex runtime plans including control flow struc¬ 
tures. The actual time estimation problem then boils down 
to estimating execution time of a single instruction given the 
size and in-memory state of its input and output variables. 


3.3 Time Estimates of Instructions 

In general, we compute the time estimate of an instruction 
as the sum of latency, 10, and computation time based on 
its input and output statistics. Earlier versions of this cost 
model relied on profiled and trained cost functions for 
individual instructions. In contrast, we now use a white- 
box cost model based on 10 bandwidth multipliers and 
operation-specific floating point operations in order to re¬ 
move the need for cluster-specific prohling runs. 

Costing CP Instructions: The time estimate of a CP 
instruction consists of lO and compute time. We estimate 
10 time based on variable state, size, format, and default 
format-specific lO band widths. If the state of an input is 
in-memory, then there is no lO time; otherwise, we compute 
the 10 time via the serialized, format-specific size M'(X) of 
this input. For example, given a 10"^ x 10® dense matrix in 
binary block format, we get M'(X) = 80 MB; by weight¬ 
ing this with the single-threaded read bandwidth for binary 
block (150 ^®/s), we get an 10 time of 0.53 s. Compute time 
is estimated as the maximum of main memory lO (computed 
via main memory bandwidth multipliers) and instruction- 
specihc models of required floating point operations. For 
example, let us use the tsmm (transpose-self matrix multipli¬ 
cation) instruction for X®^X that we introduced earlier. Its 
floating point requirements are estimated as follows: 


FLOP(t ) 


MMD_corr • m • n® • s dense 
MMS_corr • m • n® • sparse 


( 2 ) 


Finally, we convert the required flops into expected execu¬ 
tion time assuming IFLOP per cycle. For example, for X : 
10'* X 10®, MMD_corr = 0.5 (operation-specific correction), 
and a 2 GHz processor, we get f{inst) = 0.5-10®®/(2-10®) = 
2.5 s. Note that our cost model consists of dozens of these 
white-box cost functions for all existing instructions. 

Costing MR-Job Instructions: The time estimate of 
an MR-job instruction is more complex. It consists of job 


and task latency, write times for in-memory variable export, 
map task read, compute, and write times, shuffle time, as 
well as reduce task read, compute, and write times. The 
individual lO times and computation times are estimated 
similar to CP instructions, but weighted with the degree of 
parallelism of map/reduce tasks. Note that costing needs to 
take the structure of the MR job into account. For exam¬ 
ple, consider a map-only job with a single mapmm instruction 
without final aggregation for X v. This job will incur job 
and task latency as well as map read costs for X and v, the 
matrix-vector computation costs, and hnally the map result 
write costs. The sum of these map-side costs are divided by 
the effective degree of parallelism, which is computed via a 
scaled minimum of km (available parallelism) and number of 
tasks (M'(X) divided by the HDFS block size). On YARN 
clusters, we also take the CP/MR memory resources into 
account when computing the degree of parallelism. 

3.4 Examples Runtime Plan Costing 

Putting it all together, we now revisit the example runtime 
plans from Section and discuss their costing in detail. 

Example Plan Costing (Scenario XS): Figure 1^ 
shows a simplified runtime plan for scenario XS (80 MB) 
with annotated costs. Due to the simple program struc¬ 
ture, the total plan execution time of 3.31 s is computed as 
the plain sum of all instruction costs (which we show as a 
breakdown of lO and compute time). There are a couple of 
interesting observations to make. First, the instruction that 
uses a persistent input first, pays the related lO costs (e.g., 
tsmm and r ’), while subsequent operations on the same data 
(e.g., ba+*) do only account for compute time. Second, we 
see that the computation time for tsmm dominates the total 
execution time. The following heavy hitters are the initial 
read of X as well as the computation costs of solve. 

Example Plan Costing (Scenario XLl): As stated 
before, costing plans that include MR-job instructions is 
more challenging than pure CP runtime plans. Figure 


PROGRAM # total cost C=3.31s 

—MAIN PROGRAM # C=3.31s 

-GENERIC (lines 1-3) # C=2.8E-8s 

-CP createvar pREADX binaryblock # C=[0s, 4.7E-9s] 

-CP createvar pREADy binaryblock # C=[0s, 4.7E-9s] 

-CP assignvar intercept # C=[0s, 4.7E-9s] 

-CP assignvar lambda # C=[0s, 4.7E-9s] 

-CP cpvar pREADX X # C=[0s. 4.7E-9s] 

-CP cpvar pREADy y # C=[0s, 4.7E-9s] 

-GENERIC (lines 8-12) # C=3.31s 

-CP createvar _mVar2 # C=[0s, 4.7E-9s] 

-CP tsmm X _mVar2 LEFT # C=[0.51s, 2.32s] 

-CP createvar _mVar3 # C=[0s, 4.7E-9s] 

-CP rand 1000 1 _mVar3 # C=[0s. 3.7E-6s] 

-CP createvar _mVar4 # C=[0s, 4.7E-9s] 

-CP r' y _mVar4 # C=[5E-4s, 5E-6s] 

-CP createvar _mVar5 # C=[0s, 4.7E-9s] 

-CP rdiag _mVar3 _mVar5 # C=[0s, 4.7E-7s] 

-CP createvar _mVar6 # C=[0s, 4.7E-9s] 

-CP ba-r* _mVar4 X _mVar6 # C=[0s. 0.00465s] 

-CP createvar _mVar7 # C=[0s, 4.7E-9s] 

-CP -r _mVar2 _mVar5 _mVar7 # C=[0s, 4.7E-4s] 

-CP createvar _mVar8 # C=[0s, 4.7E-9s] 

-CP r’ _mVar6 _mVar8 # C=[0s. 4.7E-7s] 

-CP createvar _mVar9 # C=[0s, 4.7E-9s] 

-CP solve _mVar7 _mVar8 _mVar9 # C=[0s, 0.466s] 

-CP write _mVar9 textcell # C=[lE-6s, 2E-4s] 

Figure 4: Simplified Plan Scenario XS w/ Costs. 
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PROGRAM 
—MAIN PROGRAM 
-GENERIC (lines 


1-3) 


# total cost C=606.9s 
# C=606.9s 
# C=2.8E-8s 


-CP createvar pREADX binaryblock # C=[0s, 

-CP createvar pREADy binaryblock # C=[Os, 

-CP assignvar intercept # C=[0s, 

-CP assignvar lambda # C=[0s, 

-CP cpvar pREADX X # C= [Os, 

-CP cpvar pREADy y # C= [Os, 

-GENERIC (lines 8-12) # C=606.9s 


■CP 

createvar 

_inVar2 binaryblock 

# 

C=[Os, 

4. 

■CP 

rand 1000 

1 _inVar2 

# 

C=[Os, 

3. 

■CP 

createvar 

_inVar3 

# 

C=[Os, 

4. 

■CP 

partition 

y _inVar3 

# 

C=[10. 

2s, 

■CP 

createvar 

_inVar4 

# 

C=[Os, 

4. 

■CP 

rdiag _mVar2 _inVar4 

# 

C=[Os, 

4. 

■CP 

createvar 

_inVar5 

# 

C=[Os, 

4. 

■CP 

createvar 

_inVar6 

# 

C=[Os, 

4. 

■MR- 

-Job[ # ninap=5967 nred=l 

# 

C=[589 

.8s] 


-CP 

-CP 

-CP 

-CP 

-CP 


-jobtype 

-inputs 

-map 


--shuffle 

-agg 

—outputs 
—ret ix 
--repl 
createvar 
+ _mVar5 
createvar 
solve 
write 


GMR 

[X. _mVar3] 

MR tsmm 0 2, 

MR r> 0 3, 

MR mapmm 314 


= MR ak+ 2 
MR ak+ 4 
= [_mVar5, 

= ,5,6 
= 1 ] 

_mVar7 

_mVar4 _mVar7 
_mVar8 


5, 

6 

_mVar6] 


4.7E-9S] 

4.7E-9S] 

4.7E-9S] 

4.7E-9S] 

4.7E-9S] 

4.7E-9S] 

4.7E-9S] 
3.7E-6S] 
4.7E-9S] 
6.4s] 
4.7E-9S] 
4.7E-7S] 
4.7E-9S] 
4.7E-9S] 


latency=[144.5s] 
hdfsread=[70.7s] 
mapexec=[324.7s] 

dcread= [12.6s] 
shuffle= [19.7s] 
redexec= [11.1s] 


# hdfswrite=[0.Is] 


_mVar7 _mVar6 _mVar8 
_mVar8 textcell 


C=[0s, 4.7E-9S] 

C=[0.05s, 5E-4s] 
C=[0s, 4.7E-9S] 

C=[5E-5S,0.466s] 
C=[lE-6s, 2E-4s] 


Figure 5: Simplified Plan Scenario XL w/ Costs. 


shows the simplified runtime plan of scenario XLl (800 GB) 
with annotated costs. In comparison to scenario XS, there 
are many additional cost factors. First, cost estimates of CP 
instructions automatically adapt to the increased data sizes 
and additional operators. For example, now the partition 
instruction pays the 10.2 s costs for the initial read of y. 
Second, the total execution time of 606.9 s is dominated by 
the costs of 589.8 s for the generated MR job. Several cost 
factors contribute to this estimate. The total estimated la¬ 
tency includes 20 s job latency plus 1.5 s task latency for 
each map/reduce tasks, normalized by the effective map and 
reduce degree of parallelism. Furthermore, the HFDS read 
costs reflect reading all map inputs, again normalized by the 
effective degree of parallelism. The major cost factor, how¬ 
ever, of this compute-intensive job is the map compute time 
which is dominated by tsmm. Additional cost factors include 
read from distributed cache for the partitioned broadcast in 
mapmm, shuffle 10 time, reduce compute time (for the final ag¬ 
gregations), and the final HDFS write of A and b. Here, the 
shuffle time captures map write, actual shuffle, and reduce 
write/read. Third, despite the same remaining instructions 
as in scenario XS, we see slightly different costs (e.g., for 
+ and solve) because by tracking the in-memory state of 
variables, we automatically take hybrid runtime plans with 
data exchange over HDFS into account as well. 

Regarding cost model accuracy, in both examples, the es¬ 
timated costs were within 2x of the actual execution time. 
Due to simplifying assumptions and fundamental limita¬ 
tions, this is not given in general. However, this cost model 
allows for reasonable cost comparisons of complex ML pro¬ 
grams without the need of profiling or sample runs. 


3.5 Limitations 

The presented cost model works very well in practice. 
However, there are also fundamental limitations. 

Unknown Size Information: Despite techniques for 
propagating size information of dimensions and sparsity , 
there do exist cases where we are not able to determine sizes 
of intermediates during initial compilation. In this case, 
the compiler falls back to conservative but scalable plans in 
order to ensure plan validity. However, apart from MR job 
latency, we cannot fully infer 10 and computation costs of 
affected operators in those cases which potentially leads to 
large underestimation. This issue is commonly addressed 
by making the optimizer, using the cost model, aware of 
unknowns, which can often even be used for pruning. 

Buffer Pool Behavior: Our cost model only partially 
considers buffer pool evictions which may contribute to the 
overall program costs. In order to fully address this, we 
would need a white-box model of the buffer pool eviction 
algorithm and extend the tracking of live variables. For 
the sake of simplicity, we currently view the buffer pool as 
black box and only consider its total size. In practice, this 
is acceptable since buffer pool evictions usually account for 
a small fraction of the total execution time. 

Unknown Conditional Control Flow: Many ML pro¬ 
grams contain conditional control flow in terms of loops with 
unknown number of iterations, branches, and recursive func¬ 
tion calls. Especially for convergence-based ML algorithms, 
the number of iterations until convergence is generally un¬ 
known. Our heuristic of predefined constants clearly can 
fail there but at least reflects that the loop body is executed 
repeatedly. This already allows for optimization techniques 
like code motion or caching decisions. There is also existing 
work on estimating the number of iterations until conver¬ 
gence, which is an interesting direction for future extensions. 

4. CONCLUSIONS 

To summarize, our simple and robust cost model allows 
the costing of generated runtime plans for ML programs. 
This model automatically reflects all optimization decisions 
of the entire compilation chain. Most importantly it pro¬ 
vides an analytical cost model for alternative plans without 
the need for profiling or sample runs. It also captures all 
relevant cost factors, is aware of data and cluster character¬ 
istics, and can be used for arbitrary complex ML programs. 
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